28 research outputs found

    Calibrated Fairness in Bandits

    Get PDF
    We study fairness within the stochastic, \emph{multi-armed bandit} (MAB) decision making framework. We adapt the fairness framework of "treating similar individuals similarly" to this setting. Here, an `individual' corresponds to an arm and two arms are `similar' if they have a similar quality distribution. First, we adopt a {\em smoothness constraint} that if two arms have a similar quality distribution then the probability of selecting each arm should be similar. In addition, we define the {\em fairness regret}, which corresponds to the degree to which an algorithm is not calibrated, where perfect calibration requires that the probability of selecting an arm is equal to the probability with which the arm has the best quality realization. We show that a variation on Thompson sampling satisfies smooth fairness for total variation distance, and give an O~((kT)2/3)\tilde{O}((kT)^{2/3}) bound on fairness regret. This complements prior work, which protects an on-average better arm from being less favored. We also explain how to extend our algorithm to the dueling bandit setting.Comment: To be presented at the FAT-ML'17 worksho

    Sequential Principal-Agent Problems with Communication: Efficient Computation and Learning

    Full text link
    We study a sequential decision making problem between a principal and an agent with incomplete information on both sides. In this model, the principal and the agent interact in a stochastic environment, and each is privy to observations about the state not available to the other. The principal has the power of commitment, both to elicit information from the agent and to provide signals about her own information. The principal and the agent communicate their signals to each other, and select their actions independently based on this communication. Each player receives a payoff based on the state and their joint actions, and the environment moves to a new state. The interaction continues over a finite time horizon, and both players act to optimize their own total payoffs over the horizon. Our model encompasses as special cases stochastic games of incomplete information and POMDPs, as well as sequential Bayesian persuasion and mechanism design problems. We study both computation of optimal policies and learning in our setting. While the general problems are computationally intractable, we study algorithmic solutions under a conditional independence assumption on the underlying state-observation distributions. We present an polynomial-time algorithm to compute the principal's optimal policy up to an additive approximation. Additionally, we show an efficient learning algorithm in the case where the transition probabilities are not known beforehand. The algorithm guarantees sublinear regret for both players

    Markov Decision Processes with Time-Varying Geometric Discounting

    Full text link
    Canonical models of Markov decision processes (MDPs) usually consider geometric discounting based on a constant discount factor. While this standard modeling approach has led to many elegant results, some recent studies indicate the necessity of modeling time-varying discounting in certain applications. This paper studies a model of infinite-horizon MDPs with time-varying discount factors. We take a game-theoretic perspective -- whereby each time step is treated as an independent decision maker with their own (fixed) discount factor -- and we study the subgame perfect equilibrium (SPE) of the resulting game as well as the related algorithmic problems. We present a constructive proof of the existence of an SPE and demonstrate the EXPTIME-hardness of computing an SPE. We also turn to the approximate notion of \epsilon-SPE and show that an \epsilon-SPE exists under milder assumptions. An algorithm is presented to compute an \epsilon-SPE, of which an upper bound of the time complexity, as a function of the convergence property of the time-varying discount factor, is provided.Comment: 24 pages, 3 figure

    A randomized controlled trial of vaginal misoprostol tablet and intracervical dinoprostone gel in labor induction of women with prolonged pregnancies

    Get PDF
    Background: Objective of the study was to compare the efficacy of vaginal misoprostol and intracervical dinoprostone gel for induction of labor in women with unfavorable cervix beyond 41 weeks (287 days) of gestation.Methods: This randomized controlled trial was performed at a teaching hospital between January 2011 and December 2012. 192 women with singleton uncomplicated pregnancy with no previous uterine scar not going into spontaneous labor at 288th days of gestation .Misoprostol(25 mcg tablet)in the posterior vaginal fornix, four hourly, maximum six doses or Dinoprostone (0.5 mg gel) intracervical instillation ,six hourly, maximum three doses were given.Oxytocin was administered if needed. Primary outcome: Induction delivery interval (IDI) with incidence of delivery within 12 hours and 24 hours; mode of delivery: vaginal or caesarean section. Secondary outcome: maternal side effects, neonatal outcome. For statistical analysis chi-square test, student t- test and P-value determination were done.Results: The mean IDI was shorter in the misoprostol group compared to the dinoprostone group (p0.05). Adverse neonatal outcome (5-minutes Apgar score0.05).Conclusions: Vaginal misoprostol tablet is a safe and more effective method of induction of labour when compared with intracervical dinoprostone gel in prolonged pregnancies.

    Adversarial blocking bandits

    Get PDF
    We consider a general adversarial multi-armed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplies replenish and can be reused only after certain time periods. We first show that, in the optimization setting, when the blocking durations and rewards are known in advance, finding an optimal policy (e.g., determining which arm per round) that maximises the cumulative reward is strongly NP-hard, eliminating the possibility of a fully polynomial-time approximation scheme (FPTAS) for the problem unless P = NP. To complement our result, we show that a greedy algorithm that plays the best available arm at each round provides an approximation guarantee that depends on the blocking durations and the path variance of the rewards. In the bandit setting, when the blocking durations and rewards are not known, we design two algorithms, RGA and RGA-META, for the case of bounded duration an path variation. In particular, when the variation budget B_T is known in advance, RGA can achieve O(\sqrt{T(2\tilde{D}+K)B_{T}}) dynamic approximate regret. On the other hand, when B_T is not known, we show that the dynamic approximate regret of RGA-META is at most O((K+\tilde{D})^{1/4}\tilde{B}^{1/2}T^{3/4}) where \tilde{B} is the maximal path variation budget within each batch of RGA-META (which is provably in order of o(\sqrt{T}). We also prove that if either the variation budget or the maximal blocking duration is unbounded, the approximate regret will be at least Theta(T). We also show that the regret upper bound of RGA is tight if the blocking durations are bounded above by an order of O(1)
    corecore